This procedure applies the DESeq2 method to normalize RNA-seq read count data. It estimates size factors of RNA-seq samples based on a read count matrix and use the size factors to normalize and transform the matrix. Between-sample variance is compared between data before vs. after normalization. All samples are treated as one group during the normalization.
Â
Demo data of DESeq2 normalization
Ranbomly generated
The read count matrix includes 4 samples and 1000 variables/genes. The average read count per sample is 56.3 and the average read count per variable/gene is 24
Table 1. Summary statistics of read counts per sample, before (B) and after (A) normalization by size factor. (Total: read count summation of all variables/genes; Mean/Median: read count mean/median of all variables/genes; and Variance: standard deviation of read counts after log2(count+1) transformation).
| Total_B | Total_A | Mean_B | Mean_A | Median_B | Median_A | Variance_B | Variance_A | Size_Factor | |
|---|---|---|---|---|---|---|---|---|---|
| sample1 | 39563 | 52108.53 | 39.56 | 52.11 | 15 | 19.76 | 2.11 | 2.21 | 0.759 |
| sample2 | 60312 | 54469.38 | 60.31 | 54.47 | 22 | 19.87 | 2.24 | 2.21 | 1.107 |
| sample3 | 48843 | 52005.92 | 48.84 | 52.01 | 21 | 22.36 | 2.24 | 2.26 | 0.939 |
| sample4 | 76336 | 54832.88 | 76.34 | 54.83 | 30 | 21.55 | 2.34 | 2.23 | 1.392 |
The size factors of all samples range between 0.759 and 1.392 (geometric mean = 1.024). In general, we expect a positive correlation between total read count of a sample and its size factor, and their geometric mean is close to 1.0.
Figure 1. Relationship between read counts before normalization and size factors. Each point represents a samples.
Variables/genes contribute to the calculation of size factors differently as those with higher read counts having more weight. In most RNA-seq data, rRNAs (ribosomal RNAs) and some ‘housekeeping’ genes have the highest read counts, but they are usually not the focus of research interest and often subjected to systemic bias not affecting most of the other genes, such as the efficiency of rRNA depletion. The actual impact of top variables/genes on the values of size factors can be evaluated by removing them from the calculation one-by-one.
Table 2. Size factors after each step of removing top variables/genes. Click column names to view variables/genes removed at each step.
| Original | Step1 | Step2 | Step3 | Step4 | Step5 | Step6 | Step7 | Step8 | Step9 | Step10 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| sample1 | 0.759 | 0.759 | 0.760 | 0.761 | 0.761 | 0.762 | 0.761 | 0.766 | 0.770 | 0.769 | 0.771 |
| sample2 | 1.107 | 1.107 | 1.110 | 1.107 | 1.107 | 1.093 | 1.101 | 1.092 | 1.092 | 1.092 | 1.090 |
| sample3 | 0.939 | 0.939 | 0.940 | 0.939 | 0.940 | 0.944 | 0.948 | 0.948 | 0.948 | 0.949 | 0.948 |
| sample4 | 1.392 | 1.392 | 1.389 | 1.389 | 1.390 | 1.390 | 1.389 | 1.389 | 1.388 | 1.387 | 1.388 |
Figure 2. The change of size factor of individual samples, as the top 5% of the variables/genes were removed from calculation.
Normalization generally reduces variance between samples, which can be measured by comparing data distribution and calculating sample-sample variance.
Figure 3. Comparison of two samples with the lowest (sample1) and the highest (sample4) total read counts, before vs. after normalization.
Figure 4. Distribution of read counts before and after normalization. Read counts were log2-transformed.
Figure 5. Relationship between mean read count (log2-transformed) and between-sample variance.
To reproduce this report:
Find the data analysis template you want to use and an example of its pairing YAML file here and download the YAML example to your working directory
To generate a new report using your own input data and parameter, edit the following items in the YAML file:
Run the code below within R Console or RStudio, preferablly with a new R session:
if (!require(devtools)) { install.packages('devtools'); require(devtools); }
if (!require(RCurl)) { install.packages('RCurl'); require(RCurl); }
if (!require(RoCA)) { install_github('zhezhangsh/RoCAR'); require(RoCA); }
CreateReport(filename.yaml); # filename.yaml is the YAML file you just downloaded and edited for your analysis
If there is no complaint, go to the output folder and open the index.html file to view report.
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
##
## Matrix products: default
## BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] DEGandMore_0.0.0.9000 snow_0.4-3
## [3] rchive_0.0.0.9000 awsomics_0.0.0.9000
## [5] colorspace_1.3-2 gplots_3.0.1
## [7] MASS_7.3-51.1 htmlwidgets_1.3
## [9] DT_0.5 yaml_2.2.0
## [11] kableExtra_0.9.0 knitr_1.20
## [13] rmarkdown_1.10 RoCA_0.0.0.9000
## [15] RCurl_1.95-4.11 bitops_1.0-6
## [17] usethis_1.4.0 devtools_2.0.1
## [19] DESeq2_1.22.1 SummarizedExperiment_1.12.0
## [21] DelayedArray_0.8.0 BiocParallel_1.16.2
## [23] matrixStats_0.54.0 Biobase_2.42.0
## [25] GenomicRanges_1.34.0 GenomeInfoDb_1.18.1
## [27] IRanges_2.16.0 S4Vectors_0.20.1
## [29] BiocGenerics_0.28.0
##
## loaded via a namespace (and not attached):
## [1] fs_1.2.6 bit64_0.9-7 httr_1.3.1
## [4] RColorBrewer_1.1-2 rprojroot_1.3-2 tools_3.5.1
## [7] backports_1.1.2 R6_2.3.0 KernSmooth_2.23-15
## [10] rpart_4.1-13 Hmisc_4.1-1 DBI_1.0.0
## [13] lazyeval_0.2.1 nnet_7.3-12 withr_2.1.2
## [16] gridExtra_2.3 prettyunits_1.0.2 processx_3.2.0
## [19] bit_1.1-14 compiler_3.5.1 rvest_0.3.2
## [22] cli_1.0.1 htmlTable_1.12 xml2_1.2.0
## [25] desc_1.2.0 caTools_1.17.1.1 scales_1.0.0
## [28] checkmate_1.8.5 readr_1.2.1 genefilter_1.64.0
## [31] callr_3.0.0 stringr_1.3.1 digest_0.6.18
## [34] foreign_0.8-71 XVector_0.22.0 pkgconfig_2.0.2
## [37] base64enc_0.1-3 htmltools_0.3.6 sessioninfo_1.1.1
## [40] highr_0.7 rlang_0.3.0.1 rstudioapi_0.8
## [43] RSQLite_2.1.1 gtools_3.8.1 acepack_1.4.1
## [46] magrittr_1.5 GenomeInfoDbData_1.2.0 Formula_1.2-3
## [49] Matrix_1.2-15 Rcpp_1.0.0 munsell_0.5.0
## [52] stringi_1.2.4 zlibbioc_1.28.0 pkgbuild_1.0.2
## [55] plyr_1.8.4 grid_3.5.1 blob_1.1.1
## [58] gdata_2.18.0 crayon_1.3.4 lattice_0.20-38
## [61] splines_3.5.1 annotate_1.60.0 hms_0.4.2
## [64] locfit_1.5-9.1 ps_1.2.1 pillar_1.3.0
## [67] geneplotter_1.60.0 pkgload_1.0.2 XML_3.98-1.16
## [70] glue_1.3.0 evaluate_0.12 latticeExtra_0.6-28
## [73] data.table_1.11.8 remotes_2.0.2 gtable_0.2.0
## [76] assertthat_0.2.0 ggplot2_3.1.0 xtable_1.8-3
## [79] viridisLite_0.3.0 survival_2.43-3 tibble_1.4.2
## [82] AnnotationDbi_1.44.0 memoise_1.1.0 cluster_2.0.7-1
END OF DOCUMENT